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Abstract 

Accounting for the annual climatic variability is a well-known issue for simulation-based studies of 
environmental models. It often requires intensive sampling (e.g., averaging the simulation outputs over 
many climatic series), which hinders many sequential processes, in particular optimization algorithms. 
We propose here an approach based on a subset selection of a large basis of climatic series, using an 
ad-hoc similarity function and clustering. A non-parametric reconstruction technique is introduced to 
estimate accurately the distribution of the output of interest using only the subset sampling. The 
proposed strategy is non-intrusive and generic (i.e. transposable to most models with climatic data 
inputs), and can be combined to most “off-the-shelf” optimization solvers. We apply our approach to 
sunflower phenotype optimization using the crop model SUNFLO. The underlying optimization problem 
is formulated as multi-objective to account for risk-aversion. Our approach achieves good performances 
even for limited computational budgets, outperforming significantly more “naive” strategies. 


1 Introduction 


Using numerical models of complex dynamic systems has become a central process in sciences. In agronomy, 
it is now an essential tool for water resource management, adaptation of anthropic or natural systems to 
a changing climatic context or the conception of new production systems. In particular, in the past two 


decades crop models have received a growing attention (Boote et a 

. 1996 Brisson et al. 

2003 Brun et al. 

2006 Bergez et al. 2013 Brown et al. 2014 

McNider et al. 2014), 

as they can be used to help improve the 

plant performances, either through cultural practices (Grechi et al. 

2012 

Wuetal. 2012 

or model-assisted 

plant breeding (Semenov and Stratonovitch 

2013 Semenov et al. 

2014 

Quilot-Turion et al. 2012). 


IS 


Many times, the objective pursued by model users amounts to solving an optimization problem, that 
find the set of input parameters of the model that maximize (or minimize) the output of interet (cost, 
production level, environmental impact, etc.). Examples of such problems abound with environmental mod¬ 
els, including water distribution systems design (iTsoukalas and Makropoulosl 120141), agricultural watershed 


management (Cools et al. 20111 or the adaptation of cultural practices to climate change (Holzkamper et al. 


20151. In phenotype optimization, (or ideotype design, Martre et al. 2015), plant performance (e.g., yield) 
is maximized with respect to its morphological and/or physiological traits. 

Within the wide range of potential approaches to solve such optimization problems, black-box optimization 


methods have proved to be popular in this context (Maier et al. 2014 Martre et al. 2015 Quilot-Turion 


et al. 2012), as they only require limited expertise in optimization while being quite user-friendly, as they 


are in essence non-intrusive (i.e., they only require evaluations of the model at hand). 
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However, a well-known difficulty, shared by many models users, is to deal with climatic information. Many 
agricultural or ecological models require yearly series of day-to-day measures of precipitation, temperature, 
etc., as input variables. This is particularly crucial for agricultural or ecological models, for which the climate 
has a preponderant impact on the system. To avoid drawing conclusions biased by the choice of a particular 
set (i.e., year) of climatic data, one may either use scenarii approaches (duplicate the analysis for a small 
number of distinct climates), or average the model outputs over a (large) number of climatic datasets. Due 
to the complex plant-climate interaction, identifying scenarii may prove to be a very challenging task, and 
the alternative relies on intensive computation, which rapidly becomes computationally prohibitive if the 
analysis is embedded in a loop, even for moderately complex models. 

A natural solution is to treat the climate as a random variable, which allows the use of the robust (or 
noisy) optimization framework. However, if readily available codes abound for continuous, box-constrained 
parameters and deterministic outputs, solutions become scarce for systems depending on stochastic phenom¬ 
ena. Besides, the problem formulation becomes more complex, as typically risk-aversion preferences need to 
be accounted for. 

The methodological objective of this paper is two-fold. First, we wish to propose a clear optimization 
framework for optimization under climatic uncertainties, and in particular to account for risk-aversion con¬ 
cepts in a transparent manner. Second, as both optimization and uncertainty analysis are computationally 
intensive tasks, we need to provide an algorithmic solution to solve the problem in reasonable time. In 
addition, we wish to remain non-intrusive and generic (i.e. transposable to most models with climatic data 
inputs). Finally, in order to facilitate the use of parallel computing, we aim at limiting the complexity of 
the algirthm to its minimum. 

In this work, we focus on the problem of sunflower ideotype design using the SUNFLO crop model. 
SUNFLO is a process-based model which was developped to simulate the grain yield and oil concentration 


as a function of time, environment (soil and climate), management practice and genetic diversity (Casadebaig 


et al. 2011). It allows to assess the performance of sunflower cultivars in agronomic conditions. A cultivar 


is represented by a combination of eight genetic coefficients (see Table [^, which are the variables to be 
optimized. The SUNFLO model computes the annual yield y (in tons per hectare) for a given climatic year. 

The rest of this paper is organized as follow: Section briefly reviews previous works on phenotype 
optimization, describes the SUNFLO model and the multi-objective optimization formulation to solve the 
problem at hand. Section is dedicated to the optimization algorithm, which relies on a subset selection of 
the available climate data combined with a metaheuristic algorithm. Finally, Section provides numerical 
results and compare our approach to classical solutions. 


2 Problem definition 

2.1 Brief review of phenotype optimization 


Martre et al. (2015) provide a review of recent developments in this research domain named model-assisted 


crop improvement or ideotype design. A phenotype is defined as the expression in a particular environmnent 
of a specific genotype through its morphology, development, cellular, biochemical or physiological proper¬ 
ties. An ideotype is defined as a combination of morphological and/or physiological traits optimizing crop 
performances to a particular biophysical environment and crop management. Letort et al. (2008) devel¬ 


opped an approach to design plant ideotypes maximizing yield, using numerical optimization methods on 
coupled genetic and ecophysiological models. However, as most of the developped crop model do not include 


genetic-level inputs (Hammer et al. 2010), optimization mainly targets the phenotype level. 


In the phenotype optimization setting, ideotype design can be formulated as a problem of optimizing 


model inputs related to cultivar practices (Grechi et al. 

|M2 Wu et al. 2012 

), or 

phenotypic parameters 

(Semenov and Stratonovitch, 2013 Semenov et al. 

2014 

Quilot-Turion et al. 2 

012) 

Different purposes are 

targeted such as the adaptation to climate change 

Semenov and Stratonovitch 

20L 

1 Semenov et al. 2014) 

or the multicriterion assessment of cultivar ( 

IJuilot-Turion et al. 

2012 IQi et al. 2010). In most of these 

approaches (Letort et al. 2008 Qi et al. 2010 

Quilot-Turion et al. 

2012), the study has been performed on 
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a constant environment, in particular, using a single climatic year. Quilot-Turion et al. (20121 stated that 


further methodological developments are needed in the optimization side to reduce computational time in 
order to be able to consider multi-environments and large climatic series. In this work, the authors used the 


’Virtual Fruit’ model (Quilot et al. 20051 to design peach phenotypes for sustainable productions systems. 


Their aim is to optimize jointly three model outputs (fruit mass, sweetness and crack density) in four different 
scenarii using one climatic data serie in 2009. They first performed a sensitivity analysis in order to select 


six phenotypic model inputs amongst 60 and use the multi-objective optimization method NSGA-II (Deb 


et al., 2002) in order to solve the problem. 


Semenov and Stratonovitch (2013) proposed to evaluate a phenotype by estimating an expected yield 


using 100 climatic series, by combining the use of the stochastic weather generator LARS-WG (Semenov and 


Stratonovitch 2010) and the wheat crop model Sirius (Jamieson et al. 1998) in order to design high-yielding 


ideotypes for a changing climate in the case of two contrasting situations: Sevilla in Spain and Rothamsted 
in the United Kingdom. Inputs were nine cultivar-dependant parameters related to the photosynthesis, 
phenology, canopy, drought tolerance and root water uptake. The optimization problem was solved by using 


an evolutionary algorithm with self-adaptation (EA-SA, Beyer, 1995). 


2.2 The SUNFLO model 

In this work, we consider the SUNFLO crop model in order to assess the performance of sunflower cultivars in 


agronomic conditions. This model is based on a conceptual framework initially proposed by (Monteith 1977) 


and now shared by a large familly of crop models (Keating et al., 2003 Brisson et al. 2003 Stockle et al. 


2003). In this framework, the daily crop dry biomass growth rate is calculated as an ordinary differential 


equation function of incident photosynthetically active radiation, light interception efficiency and radiation 
use efficiency. Broad scale processes of this framework, the dynamics of leaf area, photosynthesis and biomass 
allocation to grains were split into finer processes (e.g leaf expansion and senescence, response functions to 
environmental stresses) to reveal genotypic specificity and to allow the emergence of genotype x environment 
interactions. Globally, the SUNFLO crop model has about 50 equations and 64 parameters (43 plant-related 
traits and 21 environment-related). 

In this model, a cultivar is represented by a combination of eight genetic coefficients (see Table . 
These coefficients describe various aspects of crop structure or functioning: phenology, plant architecture, 
response curve of physiological processes to drought and biomass allocation. The consequence of genetic 
modifications can be emulated by changing the values of such parameters. We consider here the design of 
sunflower cultivars for a given set of cultural practices and a specific environment. The overall objective is 
to find a phenotype that maximizes the yield for the year to come, without knowing in advance the climate 
data. We assume that the coefficients can take continuous values between a lower and an upper bound, 
which are determined from a dataset of existing cultivars (see Table [^. We denote x G X S M'’* a particular 
phenotype, where d is the number of input variables {d = 8). 

The SUNFLO model computes the annual yield y (in tons per hectare) for a given climatic year. Hence, 
it requires as an additional input a climatic serie, which consists of daily measures over a year of five 
variables: minimal temperature (T([ii„, °Cd), maximal temperature (T^ax) °Cd), global incident radiation 
(i?, MJ/m?), evapotranspiration (E, mm, Penman-Monteith) and precipitations (P, mm) We note: c = 
{Tmin, Tmax) P, P, P}- Figureprovides an example of such data. 

We use historic climatic data from five french locations Avignon, Blagnac, Dijon, Poitiers and Reims (see 
Figure from 1975 to 2012. The initial data is recorded over 365 days, but we consider only the cultural 
year (April to October, 180 days), as the yield computed by the model only depends on this period. We 
denote by D this set of climatic series, and we have C'ord(D) = N = 190 and c G 

To summarize, the yield can be seen as a function of the phenotype and the climatic serie: 


?/ : X X D 
X, c 


y(x,c) 
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Table 1: Phenotypic coefRcients and the bounds used for optimization. 


Symbol 

Description 

Min 

Max 

TDFl 

Temperature sum from emergence to 
the beginning of flowering (°C) 

765 

907 

TDM3 

Temperature sum from emergence to 
seed physiological maturity (°C) 

1540 

1830 

TLN 

Number of leaves at flowering 

22.2 

36.7 

K 

Light extinction coefficient 
during vegetative growth 

0.780 

0.950 

LLH 

Rank of the largest leave 
of leaf profile at flowering 

13.5 

20.6 

LLS 

Area of the largest leave of 
leaf profile at flowering {cwS) 

334 

670 

LE 

Threshold for leaf expansion 
response to water stress 

-15.6 

-2.31 

TR 

Threshold for stomatal conductance 
response to water stress 

-14.2 

-5.81 


50.0- 


47.5- 


45.0- 


42.5- 


Reims 


Dijon 


Lusignan 


• Avignon 

Biagnac 


0 5 

long 


Figure 1: Location of the five French stations for the historic climatic data 
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Figure 2: Dataset of the year 2009, Lusignan. 


With a slight abuse of notations, we also define: 


y(X,c) := , 

y{x,C) := [y{x,ci),...,y{x,cj)f, 

y(X, C) := 


that is, the yield function for a set of inputs, either for a set of phenotypes X = {xi,..., x/} (/ € N*), a set 
of climatic series C = {ci,..., cj} (1 < J < N), or both. 


2.3 A multi-objective optimization formulation for robust optimization 

The objective is to find a phenotype that maximizes the yield for the year to come, without knowing in 
advance the climate data. Let C be the climatic serie of the upcoming year (the upper case denoting a 
random variable); we consider in the following that C is uniformly distributed over the discrete set D. Since 
C is random, the yield y(x, (7) is also a random variable (which we denote in the following ^(x)), which 
makes its direct maximization with respect to x meaningless. 

A natural formulation is to maximize the yield expectation: 

maxE [y(x, C)] = maxE [^(x)], 

xGX xGX 


with here: E [Y (x)] = A q). 

However, in general, a farmer also wishes to integrate some prevention against risk in its decision. Such 


a problem is often referred to as robust optimization in the engineering literature (see for instance [Beyer and 


Sendhoff 2007, for a review). 


A popular solution is to replace the expectation by a performance indicator that provides a trade-off 
between average performance and risk aversion: typically, the expectation penalized by the variance or a 
so-called utility function. The drawback of such approaches is that the trade-off must be tuned beforehand 
by choosing penalization parameters specific to the method. Choosing the appropriate trade-off may not be 
straightforward, and modifying it requires to restart the entire optimization procedure. 

We propose here an alternative, which is to consider this problem as multi-objective, by introducing a 
second criterion to maximize that accounts for the risk (as in Tsoukalas and Makropoulos 2014, for instance). 
One may choose for instance to maximize a quantile: 























































































with the usual definition of the quantile: P [y < QaiY)] = a, and a € (0,0.5]. Here, it amounts to 
maximizing the yield for the (N x a)-th worst year. However, we consider here a close but numerically more 
stable criterion, called the conditional value-at-risk (CVaR, Rockafellar and Uryasev, [2000 ), defined as: 


cvaR„ [y(x)] = E [y(x)|y(x) < [y(x)]]. 


CVaRct is the average yield over the (N x a)-th worst years. 
The multi-objective optimization problem is then: 


max 

E[y(x)] 

max 

CVaRa [T(x)] 

s.t. 

X € X. 


Such a formulation is relatively classical in robust optimization, although the second objective is often 


taken as the variance of the response: var[y(x)] (as for instance in Chen et al. 1999 Jin and Sendhoff 


20031. However, considering an expectation-variance trade-off does not make sense here, as a farmer will not 


want to minimize the variability of its income (i.e., minimizing the variance) but rather minimize the risk of 
low income. 


3 Optimization with a representative subset 


The two objective functions, E[y(x)] and CVaRa [T(x)], require running the SUNFLO simulator N times 
everytime a new phenotype x is evaluated. Embedded in an optimization loop, which typically requires 
thousands to millions calls to the objective functions, this evaluation step becomes prohibitive. 

We propose to address this problem by replacing the large climatic data set fl by a small representative 
set fix- To do so, we first choose the set prior to optimization using a clustering algorithm described 
in Section 3.1 Then, the optimization algorithm is run using fix- Hence, E[y(x)] and CVaRQ[y(x)] are 


replaced by their estimates based on fix, which are described in Section [T2 


3.1 Choosing a representative snbset of climatic data 

3.1.1 Principle 

To select our subset, we propose to define a distance (or, conversely, a similarity) between two climatic series, 
then choose series far from each other using clustering algorithms. 

One can choose to consider only the dataset and define a distance that characterizes differences between 
the time series. However, the drawback of this method is that it is completely model-independent: two 
climatic series can be considered as far from each other but have a similar effect on the model, hence return 
a similar yield. Inversely, two climatic series can be generally close but return different yields because of 
small critical differences (say, a rainy week at an appropriate moment of the plant growth). 

An alternative is to consider a model-based distance: two climatic series would be far from each other 
only if they return different yields for a given phenotype. This naturally implies that all the climatic series 
are run on a (small) phenotype learning set. Therefore, the distance will be very dependent on the choice of 
the set and may result in poor robustness. 

Therefore, we propose here to combine both ideas, and define a hybrid distance that depends on intrinsic 
differences and on the effect on the model. 


3.1.2 Dissimilarity between time series 

As a climatic serie is defined by five time series of different nature, we need first to define a metric to 
compare each series separately. Due to the nature of the data, Euclidian distance can be ruled out, as it 
makes little sense here. Indeed, all the series have important day-to-day variations (corresponding to good or 
bad weather), and similar events can be observed from one series to another shifted by one or several days. 
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Figure 3: Computing of the dtw distance between two time series of maximal daily temperature (Tmax): 
Avignon in 1985 (upper curve, left scale) and Lusignan in 2012 (bottom curve, right scale). Dotted line 
represents the optimal matching of daily temperature computed by dtw, for a window size of 7 days. 



This is particularly apparent for the precipitation series, which contain many zeros and several “peaks”: 
Euclidian distance would consider two series as far from each other, as long as the peaks do not coincide 
exactly. 

A classical tool for time series analysis, sensible in our case, is an algorithm called dynamic time warping 
(DTW, Berndt and Clifford, 1994 Aach and Church] 2001 Kadous 1999). In short, DTW allows two time 
series that are similar but locally out of phase to align in a non-linear manner, by matching events within a 
given window. Note that the DTW algorithm has a 0{n'^) time complexity, which makes the dissimilarity 
computation non-trivial. However, this step should be performed only once. Given two weather series 
and Cj, five distances can be computed, according to the weather variables: d(ci, Cj)^™*”, d{ci, , 


d{ci,Cj)^, d{ci,Cj)^ and d^CijCjY 


3.1.3 Model-based dissimilarity 


This dissimilarity measures a difference in the output of the model (the yield). To do so, we choose first a 
small set of I phenot ypes: B = {xi,..., x;}. Typically, B can be chosen by Latin Hypercube Sampling (LHS, 
“fill” the search space For this basis, the yield is computed for all the climatic 
Then, the model-based distance is simply the Euclidian distance: 


McKay et al. 


series: y{B, D) € 


1979) to 

TxN 


d{ci,cj)^ = 


\ 




k=l 


3.1.4 Combining dissimilarities 

We want here to combine the six dissimilarities (one for each time series and the model-based one) into a 
single one, with equal weight to each variable. We propose to do so by normalizing the dissimilarities before 
summing them with uniform weights. As the variables are of different nature, the dissimilarities distributions 
are likely to be very different (uniform, heavy tailed, etc.), hence artificially weight the variables even if they 
are rescaled similarly. 

Here, we follow a normalization procedure proposed in Olteanu and Villa-Vialaneix| (2015) called “cosine 
preprocessing”, which works as follow: Let D be a A x matrix of dissimilarities (with values dij = d(ii.i,Xj), 
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dij = djj and da = 0). We first compute a corresponding similarity matrix S, with values: 


s^j 2 


N 


N N 


]\l 


T-k]} 


Then, we normalize S with: 


k=l 


Sii — 


k=i fc'=i 




/ o.. _L 

and the normalized dissimilarity matrix D has elements defined as: 

dij — Sii Sjj ‘^Sij — 2 QiSij. 

Now, we use a convex combination of the six normalized dissimilarities: 

OtT^.^d, 




with aT„i„ + ... + aM = 1. In the following, we use = 1/2 and the other weights equal to 1/10. 

3.1.5 Choosing a representative subset using classification 

Once the matrix of dissimilarities A is computed, most unsupervised clustering algorithms can be used to 
split the set of climatic series O into subsets. However, a difficulty here is that the centroids of the clusters 
cannot be computed. Hence, we use a variation of the k-means algorithm that only requires dissimilarities 


Oijijd'. 


JM 


( 1 ) 


to the centroids. We follow the approach described in Olteanu and Villa-Vialaneix (20151; the corresponding 
pseudo-code is given in Algorithm 

The algorithm divides the set Q into K classes ... ,C^, not necessarily of equal sizes. A class 
contains elements {c*,..., c|-}. Any element c G fl is uniquely attributed to one class and we 
have: For each class k, a representative element uj^ is chosen, which we use to define the 

representative set: Up- = {w^,... 


3.2 Non-parametric reconstruction of distributions 

The objective here is to obtain accurate estimations of the objective functions E[y(x)] and CVaRQ[y(x)] 
based on the yield computed for a new phenotype and the representative set: ?/(x, tip-). Since this set is 
small, computing directly the objective functions would lead to large errors, in particular for CVaRQ[R(x)], 
that requires an accurate representation of the tail distribution (see Figure]^. A natural alternative is to fit 
a parametric distribution the small data set, and infer the objectives on the distribution. However, the form 
of the empirical distribution (Figure]^ does not readily call for a given parametric model, and misspecifying 
the distribution shape may result with large bias. 

Hence, we propose to reconstruct the distribution using a non-parametric method, by re-using the data 
computed for the classification step, that is, the yield computed for the phenotype learning basis and all the 
climatic series (j/(R, fl)). 

The general idea is to consider a mixture model for the yield (each component corresponding to a class 

c’^y. 

/F(x)(y) = ^ ^/v'=(x)(y), 2/eK, 

k=l ^ 

f standing for the probability density function (PDF), and Y^{x) being yield within the class k. 

We decompose further F^(x) as the sum of the value at the representative element and a residual: 

y'=(x)=2/(x,a;'=)+e'=(x). 

The intra-class distribution is then characterized by the residuals e*^(x), which determine the form, spread 
(or amplitude), and bias (i.e., difference between the average value and the value of the representative 
element). All these elements vary from one class to another, which advocates the use of non-parametric 
approaches. 









Method 1 (naive) From y{B,n), we first compute the residuals Sj{xi) = y{xi,Cj) — y{xi,uj'^) (1 < i < 1; 
1 < J < 1 < k < K). Then, we average the residuals over the phenotypes of B\ 

1 ^ 

e" = [e'l,e%k] , with £j = (^0- 

i=l 

The intra-class yield variety is re-created by adding the average residual vector to the yield computed for 
the representative value: 

y'=(x)=y(x,cc'=) + £^ 

with i uniformly taken from |l,iV^]. Thus, each component of the mixture has a fixed distribution (i.e. 
independent of x), shifted according to its representative value, and the mixture shape and spread varies 
according to the distribution of the representative values (see Figure]^ for an illustration). 

However, in practice, the values of the residuals can vary substantially from one phenotype to another, 
and averaging them over B tends to destroy the shape information. To address this issue, we proposed the 
following modification: 


Method 2 (rescaled) We introduce first the weighted variance of the yield over the representative set: 


K 


K 




k=l 


N 




Note that for a new phenotype x, the only data available is indeed y{x,u}j), so few alternatives are possible. 
We then define averages of normalized residuals: 




[e\, with 


1 ^ e^lxj) 
I ^ cric(Xi) 


and the yield vector is constructed with: 

y'"(x) = y(x,a;'") -hcrif(x) x ef, 

with i uniformly taken from |l,iV^]. 

Figures and illustrate the reconstruction technique for a given (randomly chosen) phenotype. On 
Figure]^ we see how the estimated distribution is built using the residuals corresponding to each class. We 
can see that the range and shape of the residuals vary considerably from one class to another. Also, their 
distribution around the representative element differs: as the residuals do not have a zero mean, the value 
of the representative element is not necessarily central for each class. Comparing the reconstructed (Figure 
top) and actual (bottom) distributions, we see that the mixture is globally the same on both graphs. 

Figure]^ shows the cumulative distribution function (CDF) of the actual yield and of three estimations: 
using the two methods described above and a simple parametric method, which consists in assuming a 
Gaussian distribution of the yield. The empirical CDF corresponding to the subset values only is also 
depicted, with unequal steps to account for the different number of elements in each class. 

We first notice that the subset data only is obviously insufficient to evaluate accurately the mean or 
the CVaR. Then, we see that the actual distribution does not seem to belong to a known distribution, and 
using a normal distribution introduces a large bias. Inversely, using a non-parametric reconstruction allows 
us to match the shape of the actual distribution. The difference between the two methods is small for this 
example, yet the second approach is slightly better almost everywhere. 

In our study, we found that this second method provided a satisfying trade-off between robustness, 
simplicity and accuracy. Yet, many refinements would be possible at this point, for instance by introducing 
intra-class rescaling (different normalization for each class), bias correction, or using the distance from the 
phenotype x to the basis B. 
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Figure 4: Estimated yield distribution of a given phenotype. The colors show how the reconstruction works: 
each color corresponds to a class fc, and the vertical bars to the representative element of the class. 
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Figure 5: Actual and estimated distributions (CDF) of the yield of a given phenotype. 
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3.3 Optimization and reconstruction update 

Finally, the multi-objective optimization problem solved is: 


max 

E 



1 max 

CVaRa 

i>(x)' 

[ s.t. 

X G X, 



with Y (x) a mixture oiY^ (x),..., Y^ (x). 


One may note that E 


Y (x) and CVaRc Y (x) serve as estimates of E [Y (x)] and CVaRc [Y (x)] , re¬ 
spectively. These estimates are based on the phenotype basis S, which is sampled uniformly over X to 
offer a general representation of the phenotype space. This feature is important at the beginning of the 
optimization to ensure that the optimizer does not get trapped into poorly represented regions. However, as 
the optimizer converges towards the solution, the search space becomes more narrow, and a substantial gain 
in performance can be achieved by modifying the estimates so that they are more accurate in the optimal 
region. 

In theory, it is possible to re-run the entire clustering procedure after a couple of optimization iterations, 
by adding new phenotypes to the learning set. However, such strategy is likely to increase greatly the 
computational burden. We propose instead to modify only the reconstruction step, for which only very few 
additional calculations are required. 

Indeed, the reconstructed yield ditributions use the phenotype learning basis B and their associated 
values By replacing the initial B with B' formed by phenotypes chosen inside the optimal region, 

we obtain yield values y{B', fl) that are more likely to represent the actual distribution within this region. 
Such “specialization” may be to the detriment of the global accuracy of the estimates, but this is not critical 
as the optimizer concentrates on a narrow region. 

Including a new phenotype x' into the basis B requires running the SUNFLO simulator N times to obtain 
j/(x',r2). Therefore, an efficient trade-off must be found between pursuing the optimization and improving 
the estimates. Also, it may be beneficial to discard phenotypes in B that are far from the optimal region. 
In summary, we need to: a) decide when to add phenotypes to the basis and b) when to discard them and 
c) choose which to add / discard. 

A simple strategy is to perform only two steps: first, run the optimization with the initial basis B. Then, 
select I new phenotypes from the obtained Pareto set and replace the entire basis B after running the N x I 
simulations. Finally, restart the optimization with the new estimates. We have found (Section]^ that this 
two-step strategy was sufficient on our problem, while relatively easy to implement. 


3.4 Optimization procedure overview 

To summarize this section. Algorithm describes the complete optimization procedure, including the initial 
clustering and the two-step strategy. Each step relies on the call to a metaheuristic algorithm such as NSGA- 
H (Deb et al. 2002) or MOPSO-CD ( [Raquel and Navffi 2005). Hence, two-step MOPSO-CD stands for the 
tow-step algorithm using the MOPSO-CD metaheuristic. 
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Algorithm 1 Two-step optimization algorithm 

Initialization 

Choose phenotype database B, and compute yield matrix y{B, fl) 

Compute matrix of dissimilarity A 

Run clustering algorithm to obtain the classes and the representative set VIk 

Get residuals from 


Optimization: run 1 

Choose population size q and number of iterations T 

for t= 1, ..., T do 

Select new phenotypes ■ ■ ■ ,^new} according the metaheuristic. 

Calculate yield for the representative set for each new phenotype y{xl^^^,flK) 


) and evaluate E Y (xjj, 


and CVaRr, 


Reconstruct T(x^ 

Post process according the metaheuristic. 

end for 

Get Pareto-optimal solutions X* 


y«eJ 


Optimization: run 2 

Replace S by X*, compute yield matrix y{B, fl) 

Get the new residuals from y{B, SI) 

for t= 1, ..., T do 

Select new phenotypes ■ ■ ■ ,^new} according the metaheuristic. 

Calculate yield for the representative set for each new phenotype j/(x)jg^, Sli^) 


) and evaluate E T(x)jg^) 


and CVaRn 


Reconstruct T(x), 

Post process {x)jg.jg,... ,x®g^} according the metaheuristic. 

end for 

Get Pareto-optimal solutions X* 


YiKew) 


4 Experimental setup 


4.1 Climate subset selection 


In this experiment, we used the R package dtw (Giorgino, 2009) to compute all the distances between climatic 


series. Note that the window size (that is, the maximum shift allowed) is a critical parameter of the method; 
we use here expert knowledge to choose it. For the precipitation, a window of ±3 days is used; for the other 
variables, a window of ±7 days is chosen. The phenotype basis B is chosen as a 10-point LHS; hence, for 
this step the method required 1, 900 calls to the SUNFLO model. 

Once the dissimilarity matrix A is computed, the clustering algorithm (see Appendix]^ is run. Since this 
algorithm amounts to a gradient descent, it provides a local optimum only, so we need to restart it several 
times (by changing the initial values /3q) to ensure that a good optimum is found. We found in practice that 
500 iterations and 10 restarts were sufficient to achieve a good robustness. 

This algorithm does not choose automatically the number of classes K. We found empirically that K = 10 
provided a satisfying trade-off between the representation capability of the subset and the computational 
cost during the optimization loop. 


4.2 Optimization 

To solve the multi-objective optimization problem, we chose to use the MOPSO-CD metaheuristic (Multi- 


Objective Particle Swarm Optimization with Crowding Distance Raquel and Naval, 2005). MOPSO-CD is a 


stochastic population-based algorithm inspired by the social behavior of bird flocking. In short, the algorithm 


12 



















maintains over T generations a population P of individuals (candidate solutions). At each generation, each 
candidate is moved through the search space according to an individual direction (local improvement), a 
global direction (towards the best candidates of the population) and a crowding distance. This distance is 
used in order to build a set of solution that fills uniformly the Pareto front. 

In the following experiments we used the R package dtw (Naval 20131. The two main parameters of 


MOPSO-CD are the population size and number of generations (their product being equal to the number of 
function evaluatuions). 

In order to assess the validity of our approach, we have conducted and empirical comparison to simpler 
approaches: random search and a “naive” optimizer, both using the full set of climatic series. In addition, 
we have conducted an intensive experiment to obtain an accurate representation of the actual Pareto set. 

The intensive experiment consists in running two multi-objective algorithms (NSGA-II and MOPSO-CD) 
with a very large budget (number of calls to the simulator function) using the full set of climatic series. The 
two obtained Pareto fronts are merged to a single one, which we consider as “exact” in the following. We 
set the number of iterations to 300 and the population size to 200, hence computing the exact Pareto front 
requires 2 x 200 x 300 x 190 = 22, 800,000 calls to SUNFLO. 

Random search, or LHS search, is performed using a latin hypercube sampling approach to fill the search 
space X. The naive optimization is performed using the original MOPSO-CD algorithm. Each sampled point 
is evaluated using the entire set of climatic series {N = 190) to estimate the expected yield and CVaR. 

We compare the different approaches based on an equal number of calls to SUNFLO (that is, we do not 
consider the time costs related to each approach). We considered four budgets: large (380,000), medium 
(95,000), small (23,750) and very small (11,400). 

For the naive and two-step approaches, we need to define the number of iterations and the population 
size. We set the number of iterations to approximately five times the popuplation size, except for the very 
small budget where the population size would be to small. For the two-step algorithm, each evaluation of the 
expectation and CVaR requires 10 SUNFLO runs, which allows a larger population and number of iterations 
than the naive approach, but it is also necessary to compute two times y{B, 12) (the simulations of yields 
over all climatic series for the phenotype basis), which has a 10 x 190 cost. 

The different setups are given in Table Note that the budgets are only approximately equal (due to 
rounding issues). Nevertheless the budgets for the two-step approach are always equal or smaller than the 
naive one. 

Since these three optimization approaches are stochastic, each experiment is replicated 10 times, to assess 
the robustness of the results. 

The time cost of one call to the SUNFLO model is low (~ 0.1 sec), which makes it possible to perform 
such an extensive experiment. However, to limit the computational costs, these experiments are performed 
with either a symmetric multiprocessing (SMP) solution based on 30 cores or a message passing interface 
(MPI) implementation based on 40 cores, depending on memory requirements of experiments, which makes 
time costs comparisons meaningless. 


The SUNFLO model has been implemented on the VLE software 

(Quesnel et al. 

2009 

I in the RECORD 

project which is dedicated to agorecosystems study ( 

Bergez et al. 

2013 

). VLE is a multi-modeling and 


simulation platform coded in C-I--I- that provides both a shared memory and a MPI based parallelisation 
for the simulation of multiple input combinations. A native port rvle to the sofware R is available in order 
to call simulations from this statistical tool. The other R packages used are fExtremes (computation of 
CVaR statistic), Ihs (optimized LHS generation), emoa (dedicated tools for multiobjective problems) and 
mco (NSCA-H implementation). Finally, we are grateful to the genotoul bioinformatics platform Toulouse 
Midi-Pyrenees for providing help and/or computing and/or storage resources. 


13 


















Table 2: Experiments performed for the two-step MOPSO-CD algorithm evaluation. 


Optimization 

experiment 

Budget 

Nb of 
iterations 

Pop 

size 

Real nb 
of simulations 

Intensive 

Very 

large 

300(x2) 

200 

~ 2 X 10^ 


very small 

- 

60 

11,400 

Random 

small 

- 

125 

23,750 

(or LHS) 

medium 

- 

500 

95,000 


large 

- 

2,000 

380,000 


very small 

12 

5 

12,350 

Naive 

small 

25 

5 

24,700 

MOPSO-CD 

medium 

50 

10 

96,900 


large 

100 

20 

383,000 


very small 

42(x2) 

9 

11,540 

Two-step 

small 

71(x2) 

14 

23,960 

MOPSO-CD 

medium 

152(x2) 

30 

95,600 


large 

308(x2) 

61 

380,780 


5 Results and discussion 


5.1 Climate subset selection 

We analyze first the classification obtained with our approach. As the classification is based on non-trivial 
distances, it is difficult to characterize each class with integrated quantities (e.g. rainy / hot years, etc.). 
We provide in the following three tools for this analysis. 

We first plot a 2D projection of the climatic series based on the matrix of distances A computed as in 
Section 3.1.4[ To do this, we use the R package cmdscale (Classical Multidimensional Scaling) (Figure |^a). 
Such a representation allows us to see whereas the classes are well-separated, if there are outliers, etc. 

In Figure |^b, the number of climatic series, grouped by their localization is given for each cluster. 
Finally, a decision tree has been learnt (with the R package C50) using the cluster index of climatic series 
as the variable to explain (Figure |^c). We highlight here that this tree is solely for interpretation purpose 
and is not linked to the proposed classification strategy. We used temporal mean aggregation of climatic 
variables {Tmin, Tmax, R, E, P} and the mean yield simulated on the 10 phenotypes in B to build the decision 
tree. 

Based on these three representations, one can conclude that some clusters correspond more or less to 
wheater types from the South of France (Avignon, Blagnac : 0, 5, 7, 9) rather warm (5, 7) or not (0, 9) 
and leading to high yields (5, 9) or not (0, 7). The three clusters 0, 5 and 7 seem indeed the most easy to 
characterize (Figure |^a). 

Cluster 1 represents climatic series leading to low yields from all locations. Clusters 3, 4, 6, 8 correpond 
rather to wheater types from the north of France leading to high yields (3, 4, 6) or not (8). Clusters 2, 4, 
6, 9 can be characterized by a cold weather and high yields but there are difficult to distinguish from each 
other; there is indeed an important mixture of clusters in node 6 in Figure [^c and one can make the same 
observation when studying the projection in Figure |^a. 

While a simple characterisation of clusters can be done, there are still differences between them that 
we do not achieve to characterize, which motivates the approach of using a distance between time series. 
Especially, there is a known high impact of rain episodes and their localization in time, however, the temporal 
mean aggregation of rain is not retained when building these decision trees. 
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cluster index 



Node 4 (n = 26) Node 5 (n = 28) Node 6 (n = 67) Node 8 (n = 38) Node 9 (n = 31) 



0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 


Figure 6: (a) Clusters and individuals (the 190 time series) are plotted in a 2D projection using Classical 

Multidimensional Scaling. Each digit represents a weather time series which value corresponds to its cluster. 
Climate series of representative set Q,k are plotted in bold italic, (b) Number of climatic series by cluster 
splitted by localization in France, (c) A decision tree to explain clusters using, for each climatic series, the 
temporal mean values of climatic variables and the mean yield simulated on the 10 phenotypes in B. 
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Figure 7: Pareto fronts obtained with the different methods for the four budgets considered. 


5.2 Phenotype optimization 


5.2.1 Algorithm performance 

Next, we compare the performances of the three approaches. As measuring performance is non-trivial 
in multi-criteria optimization, we use three indicators: hypervolume, epsilon and i ?2 indicators (as recom¬ 


mended in Zitzler et al. 2003 Hansen and Jaszkiewicz 1998), all available in the R package emoa (Mersmann 


20121. They provide different measures of distance to the exact Pareto set and coverage of the objective 
space. In short, the hypervolume indicator is a measure of the volume contained between the Pareto front 
and a reference point (here, the worst value of each objective). The epsilon indicator is a maximin distance 
between two Pareto fronts (here, we use the exact Pareto front as reference), while the i ?2 indicator can be 
seen as an average distance. Figure]^ shows all the Pareto fronts (of the different runs and methods) for the 
different budgets, and Figure]^ shows the corresponding performance indicators in the form of boxplots. 

For the very small budget, we see that no method succeeds at hnding the exact Pareto front. Besides, 
most of the Pareto fronts consists of a single point. However, the two-step approach still largely outperforms 
random search, while a naive use of MOPSO-CD performs worst, as it requires a certain number of iterations 
to hnd a descent direction. 

For the small and medium budgets, the two-step approach consistently finds a good approximation of the 
Pareto front (with the exception of two outliers with the small budget). For the three indicators, it clearly 
outperforms the other approaches. 

For the large budget, we see that the regular MOPSO-CD performs slightly better, which is expected. 
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Figure 8: Performance indices of the different methods for the four budgets considered. 
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Indeed, as soon as there is no necessity of parcimony, using approximate objectives instead of actual ones 
tends to slower, rather than accelerate, convergence. 


5.2.2 Results analysis 


Finally, we characterize the results on the phenotype space. We compare here the exact Pareto set with one 
run of the two-step method; we chose the run on the medium budget with the median performance. For 
readability, we only consider a subset of the Pareto set of size five, equally spaced along the Pareto front. 
The Pareto fronts and sets are represented in Figure]^ 

We can see first that considering both the expectation and CVaR for optimization leads to a large 
variety of optimal phenotypes. Looking back at the plant characteristics corresponding to those solutions, 
the optimum value for five traits had little variability, meaning that those traits were important plant 
characteristics for crop performance in the tested environments. Those five traits depicted plants adapted to 
water deficit: a late maturity (TDM3), a low leaf number (TLN), largest leaves at the bottom of the plant 
(LLH), a small plant area (LLS), and a conservative strategy for stomatal conductance regulation (TR). The 
three other traits (TDFl, K, LE) displayed variability in optimal values, which was identified as the basis 
of the performance/stability trade-off (expectation/CVaR). Here, the traits vary monotonically along the 
Pareto front. 

Four distinct plant types could be identified in the phenotype space. For example, the red plant type 
had an early flowering (TDFl), a low light extinction efficiency (K) and a low plant leaf area (LLS); those 
characterictics correspond to a conservative resource management strategy. In an opposite manner, the 
light-blue type displays a late flowering, a high efficiency to intercept light and a larger plant leaf area, 
characteristics usually associated with a productive but risky crop type when facing strong water deficit 
(Connor and Hall 1997|. The strategy associated with plant types identified from the phenotype space 


matched their position in the Pareto front, i.e the light-blue plant type was more performant but less stable 
than the red one. 

The Pareto set obtained with the two-step method reproduces part of these features: the fixed traits are 
similar (except TLN, which is fixed to approximately 0.5 instead of 0, (but this parameter is known to have 
little impact on the yield, see Casadebaig et al. (20111) and the variation of TDFl and LLS is well-captured. 
However, on this run the method failed at finding the variation of the K and LE traits: this probably explains 
why the largest mean values (left of the Pareto front) are missed. 

Overall, the two-step method allowed to identify the few key traits were responsible for the cultivar global 
adaptation capacity whereas secondary traits supported alternative resource use strategies underlying the 
yield expectation/stability tradeoff. 


6 Summary and perspectives 

In this article, we proposed an algorithm for phenotype optimization under climatic uncertainties. Our 
approach does not require any a priori knowledge on the system besides parameter bounds, hence is usable 
with any simulator depending on similar climatic data. Using subset selection for the climates allowed us to 
reduce substantially the computational time without adding implementation issues. If bias correction seems 
inevitable during optimization, we showed that a two-step strategy was sufficient to achieve convergence: 
this point is critical as it allows our approach to be combined with any black-box multi-objective solver. 

Nevertheless, we see many opportunities for further improvements. First, the distance used here between 
climate series does not account for the fact that agronomical systems are mostly sensitive to a few critical 
periods (e.g., during flowering, grain filling). Weighting the DTW distance using expert knowledge or the 
results of a sensitivity analysis may greatly improve the classification of the climates with respect to their 
impact on the model. 

Second, the reconstruction step may benefit from additionnal study, in particular the effect of the subset 
size, which has been fixed to 10 in our study for practical reasons but could be chosen using preliminary 
experiments for instance. Another interesting topic would be to target the reconstruction to improve the 
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TDF1 TDM3 TIN K LLH LLS LE TR 

Phenotype name 



TDF1 TDM3 TEN K LLH LLS LE TR 

Phenotype name 


Figure 9: Top: exact Pareto front. The bold circles correspond to a subset of five optimal phenotypes; the 
triangles correspond to five phenotypes returned by the two-step method. Middle: optimal phenotype values 
(one curve corresponds to one phenotype). Bottom: phenotype values obtained with the two-step method. 
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quality of the objectives. Indeed, the proposed approach aims a reconstructing the entire output distribution, 
while it is only important to obtain good estimates of the expectation and the CVaR. 

Third, a popular strategy to reduce the computational costs is to combine optimization with the use of 


surrogate modelling (see for instance Di Pierro et al. 2009 Tsoukalas and Makropoulos 2014 for recent ex¬ 


amples). Our approach straightforwardly extends to such approaches, and would result in very parcimonious 
algorithms that may be beneficial for expensive simulations. 

Finally, we have chosen here to use a two-step strategy to allow the use of “off-the-shelf” optimization 
solvers. Interlinking optimization and learning may improve substantially the efficiency of the method, 
although requiring the development of an ad hoc algorithm. 


Appendix: clustering algorithm 

This section details the clustering algorithm used and the rule to chose the representative element of each 
class. The key of this particular approach is that, contrarily to a standard k-means algorithm, we cannot 
compute explicitely a central element (i.e., a “virtual” climatic series). 


Algorithm 2 Clustering algorithm 

Initialize (3 in randomly such that fdij > 0, V*,j and Aj = 1) Vj. Each line (3f. is the 

dissimilarity of the centroid Qk to the climates. 

for t= I, ..., T do 

Pick i randomly in 1,..., iV (one climate selected randomly) 

Assignment step Find j (center closest to Ci) such that 


with Ai the i-th line of A. 

Representation step (update center) 

^ ^ ~ 

where Ij is a vector of zeros except its j-th value equal to one and r{t) = 

end for 


Once f3 has converged, each climate Ci is attributed to the class j, using: 

1 T 

j = arg min (/3fe A,) - -/3fc A/J^ . 

For each class k, a representative element oj^ is chosen. We choose here the most central element in terms 
of dissimilarity. Let A^ be the submatrix of A corresponding to the elements of C^. We choose: 

Ar*= 

oj’^ = c) with / = arg min V 5% 

- - i=i 
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